School Connect: Intro to DS & AI
Indian Institute of Technology, Madras
Player | Team | Ave | SR |
---|---|---|---|
Ibrahim Zadran | Afghanistan | 28.87 | 107.44 |
Rahmanullah Gurbaz | Afghanistan | 35.12 | 124.33 |
JC Buttler | England | 42.8 | 158.51 |
PD Salt | England | 37.6 | 159.32 |
Bag-of-Words
approachObserve that sentences 1
and 3
are more similar to each other, than with 2
.
We must concretize what we mean by similarity
A possible definition:
Sentence Similarity
Given 2 sentences, the similarity between them can be defined as the number of words common to each other.
quick brown fox jumps over the lazy dog
lazy dog slept in the sun
quick brown fox jumps over the lazy dog
pen is mightier than the sword
quick brown fox jumps over the
lazy
dog
lazy
dog
slept in the
sun
quick brown fox jumps over the
lazy dog
pen is mightier than the
sword
3
and 1
respectively.Bag-of-Words
Now suppose the Bag-of-Words
features are given instead of the sentences.
Can the similarity score (between 1
-3
and 2
-3
) still be computed?
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) =
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 |
1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 |
1 |
1 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) =
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 |
1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 |
1 |
1 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) = 1 + 1 + 1
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) = 3
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) = 3
similarity(2, 3) =
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) = 3
similarity(2, 3) =
quick | brown | fox | jumps | over | the | lazy | dog | slept | in | sun | pen | is | mightier | than | sword | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
similarity(1, 3) = 3
similarity(2, 3) = 1
Observe that by construction of the bag-of-words features, the calculation for sentence-similarity is more straightforward.
All the resulting features are of an equal length. This is generally a desirable property.
The ASCII representation will incur more steps in the calculation of sentence similarity.
Given the Bag-of-Words features, is it possible to reconstruct the original sentence? Is it possible in the ASCII case?
Recall that the similarity between sentences 2
and 3
is 1
, because of the word the
.
The words slept
and sleeping
are classified into two separate columns.
Words like sleepy
and drowsy
that convey more or less the same meanings (synonyms) also are given separate columns
Is this desirable? Can we come up with more meaningful features, such that the similarity score also becomes meaningful?
The Natural Language Processing field delves into the above questions deeply.
Pixels
The more the colours, the more possible values a pixel can take.
The more the colours, the more possible values a pixel can take.
The more the colours, the more possible values a pixel can take.
The raw pixel values of the image can be used as features for algorithms.
Mario To Pixels